Skip to content

Site/komipo 20251211#166

Open
seongmincho315 wants to merge 20 commits intodevelopfrom
site/komipo-20251211
Open

Site/komipo 20251211#166
seongmincho315 wants to merge 20 commits intodevelopfrom
site/komipo-20251211

Conversation

@seongmincho315
Copy link
Copy Markdown

@seongmincho315 seongmincho315 commented Apr 6, 2026

< 추가 내용>

  1. 첨부용 전처리기
  • 워크플로우 guardrail 추가
  • 글리프가 아에 없을시, ocr 수행
  1. OneAgent 연동용 전처리기 추가(첨부용 전처리기 기반)
  2. json 전처리기 통합(중부발전 PMS[TM, 경상오더, 발전정지])
  • json 내부 텍스트에도 청킹 적용
  1. 지능형 전처리기

Checklist:

  • Documentation has been updated, if necessary.
  • Examples have been added, if necessary.
  • Tests have been added, if necessary.

Summary by CodeRabbit

  • New Features

    • Added content safety guardrail processing for document chunks
    • Introduced multi-format document processing with automatic file type detection
    • Enhanced chunking algorithms with token-aware optimization
    • Expanded support for various file formats including audio transcription
  • Bug Fixes

    • Fixed conditional OCR processing for HTML-origin documents
  • Chores

    • Updated container image versions to 1.3.7-komipo and 1.3.3-komipo
    • Disabled image publishing in build configuration

@seongmincho315 seongmincho315 requested a review from inoray April 6, 2026 04:34
@gemini-code-assist
Copy link
Copy Markdown

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 16, 2026

📝 Walkthrough

Walkthrough

This PR introduces comprehensive document preprocessing infrastructure by adding multiple specialized processor modules (OCR, JSON, OneAgent, attachment handling) with guardrail safety checks, updates build configurations and Docker image versions to 1.3.7-komipo and 1.3.3-komipo, adds foundational utilities for chunking and metadata management, and modifies deployment scripts for image registration workflows.

Changes

Cohort / File(s) Summary
Build Configuration Updates
build-script/doc-parser-build.config, build-script/paddle-ocr-build.config, genon/preprocessor/scripts/register.config
Updated Docker image versions: doc-parser to 1.3.7-komipo, paddle-ocr to 1.3.7-komipo, and preprocessor to 1.3.3-komipo. Disabled image publishing in doc-parser config and removed container runtime settings.
Documentation Updates
genon/README.md
Added versioned Docker image export/restore instructions for both preprocessor (1.3.3-komipo) and OCR (1.3.3-komipo) with docker save/docker load workflows.
Core Preprocessor Modules
genon/preprocessor/facade/attachment_processor.py, genon/preprocessor/facade/attachment_processor_guardrail.py, genon/preprocessor/facade/intelligent_processor_ocr.py, genon/preprocessor/facade/json_processor.py, genon/preprocessor/facade/oneagent_processor.py, genon/preprocessor/facade/legacy/적재용(외부)_ocr.py
Implemented comprehensive document processing pipelines with guardrail safety filtering, OCR handling, JSON/tabular/audio format support, and document-to-vector conversion. Added format-specific loaders (HWP, text, audio, tabular) with chunking strategies. Modified legacy OCR processor to conditionally skip table-cell OCR for HTML origins.
Module-Level Infrastructure
genon/preprocessor/module/base_processor.py, genon/preprocessor/module/intelligent_processor.py, genon/preprocessor/module/test.py, genon/preprocessor/module/test_processor.py
Added foundational BaseProcessor class with format option building and document conversion orchestration. Implemented DocumentProcessor for full ingestion pipelines with configurable pipelines and OCR support. Created test harness with async processing and result serialization.
Chunking & Metadata Utilities
genon/preprocessor/module/utils/chunkers.py, genon/preprocessor/module/utils/metadata.py, genon/preprocessor/module/utils/genos_util.py, genon/preprocessor/module/utils/util.py
Introduced GenosBucketChunker and SimpleChunker for layout-aware and size-based document splitting with token counting. Added GenOSVectorMeta and GenOSVectorMetaBuilder for rich chunk metadata assembly. Implemented file-to-PDF conversion, magic-header file type detection, and exception handling utilities.
Deployment Script Updates
genon/preprocessor/scripts/register_image.sh
Refactored image registration workflow: commented out local Docker image checks, changed resource group ID from 2 to 1, replaced direct MySQL queries with kubectl exec for MariaDB pod access, and streamlined error handling in image ID extraction.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

  • Task/165 vlm table #169: Introduces layout and table-structure model integrations (LayoutModelType, TableStructureModelType, GenosDotsOCR, GenosVlmTableStructure) that are utilized by the newly added OCR and intelligent processor modules in this PR.

Suggested reviewers

  • HeechanKim-Genon

Poem

🐰 Hops through pipelines with glee,
Chunks and guards each doc with care,
JSON, audio, images fair—
Metadata blooms in every tree,
Now safety filters all we see!

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 40.07% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The title 'Site/komipo 20251211' is vague and generic, using non-descriptive terms that don't convey meaningful information about the changeset's primary purpose or objectives. Revise the title to clearly describe the main change, such as: 'Add attachment, JSON, and intelligent preprocessors with guardrail support' or 'Implement modular document preprocessing pipeline with guardrail validation'.
✅ Passed checks (1 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch site/komipo-20251211

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 15

Note

Due to the large number of review comments, Critical severity comments were prioritized as inline comments.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
genon/preprocessor/scripts/register_image.sh (2)

61-73: ⚠️ Potential issue | 🔴 Critical

Critical: HAS_LOCAL_IMAGE is undefined, breaking push logic.

The local image check (lines 61-68) is commented out, but line 73 still references HAS_LOCAL_IMAGE. Since this variable is never set, the condition [[ "${HAS_LOCAL_IMAGE}" != "yes" ]] will always evaluate to true (unset variables expand to empty string), causing the script to always skip to the registry check or user prompt path, never proceeding with a normal push.

Either uncomment the local image check or remove the HAS_LOCAL_IMAGE condition entirely.

🐛 Option 1: Uncomment the local image check
-# # ── 로컬 이미지 확인 ────────────────────────────────────────
-# step "로컬 Docker 이미지 확인"
-# if docker images | awk '{print $1":"$2}' | grep -qx "${FULL_IMAGE_NAME}"; then
-#   ok "로컬 이미지 존재"
-# else
-#   fail "로컬에 ${FULL_IMAGE_NAME} 없음. 먼저 build/push 하세요."
-#   exit 1
-# fi
+# ── 로컬 이미지 확인 ────────────────────────────────────────
+step "로컬 Docker 이미지 확인"
+HAS_LOCAL_IMAGE="no"
+if docker images | awk '{print $1":"$2}' | grep -qx "${FULL_IMAGE_NAME}"; then
+  ok "로컬 이미지 존재"
+  HAS_LOCAL_IMAGE="yes"
+else
+  echo "⚠️ 로컬에 ${FULL_IMAGE_NAME} 없음"
+fi
🐛 Option 2: Remove the HAS_LOCAL_IMAGE condition
 # ── docker push (포그라운드 / 재시도) ───────────────────────
 step "docker push"
 SKIP_PUSH="no"
-if [[ "${HAS_LOCAL_IMAGE}" != "yes" ]]; then
+# Check if image exists in registry before attempting push
+if ! docker image inspect "${FULL_IMAGE_NAME}" >/dev/null 2>&1; then
   if [[ -n "${REGISTRY_API_URL:-}" ]]; then
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@genon/preprocessor/scripts/register_image.sh` around lines 61 - 73, The
script references HAS_LOCAL_IMAGE in the "docker push" block but the local image
check that sets it (which inspects FULL_IMAGE_NAME) is commented out, so
HAS_LOCAL_IMAGE is never defined; restore the local-image check or explicitly
set HAS_LOCAL_IMAGE before the push logic (e.g., run the docker images | awk ...
| grep -qx "${FULL_IMAGE_NAME}" test and set HAS_LOCAL_IMAGE="yes"/"no") so the
condition in the step "docker push" behaves correctly, or alternatively remove
the HAS_LOCAL_IMAGE conditional entirely and rely on the registry/user prompt
flow.

154-171: ⚠️ Potential issue | 🔴 Critical

Critical: SQL INSERT is never executed.

SQL_INSERT is assigned but mysql_query is never called to execute it. The subsequent SELECT (lines 165-169) queries for a row that was never inserted, returning empty. The script will incorrectly report "DB 등록 완료" with an empty IMAGE_ID.

🐛 Proposed fix: Execute the INSERT before querying
   SQL_INSERT="
       INSERT INTO llmops.system_docker_image_tb
         (name, tag, description, type, status, is_active, reg_date, mod_date, reg_user_id, mod_user_id)
       VALUES
         ('${IMAGE_NAME}', '${IMAGE_TAG}', '${DESCRIPTION}', '${TYPE_LIST_JSON}', 'COMPLETED', 1, NOW(), NOW(), 1, 1);
       INSERT INTO llmops.resource_meta_tb
         (resource_id, resource_type, resource_group_id, is_active, reg_date, mod_date, reg_user_id, mod_user_id)
       VALUES
         (LAST_INSERT_ID(), 'DOCKER_IMAGE', 1, 1, NOW(), NOW(), 1, 1);
-    " 2>/dev/null
+    "
+
+  if ! mysql_query "${SQL_INSERT}"; then
+    fail "DB INSERT 실패"
+    exit 1
+  fi

   IMAGE_ID=$(
     kubectl exec -it "${MARIADB_POD}" -n "${K8S_NAMESPACE}" -- \
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@genon/preprocessor/scripts/register_image.sh` around lines 154 - 171, The
SQL_INSERT variable is never executed, so the SELECT that populates IMAGE_ID
finds nothing; run the SQL_INSERT against the MariaDB before performing the
SELECT (e.g. execute SQL_INSERT via the same kubectl exec mysql invocation or
pipe it into mysql) so the INSERT into system_docker_image_tb and
resource_meta_tb actually occur; keep using the same kubectl exec command
pattern (referencing SQL_INSERT, kubectl exec -it "${MARIADB_POD}" -n
"${K8S_NAMESPACE}" -- mysql -u "${MYSQL_USER}" -p"${MYSQL_PASS}" llmops -se) and
then run the SELECT to set IMAGE_ID, and preserve/handle any errors from the
mysql invocation.
🟠 Major comments (20)
genon/preprocessor/scripts/register.config-4-4 (1)

4-4: ⚠️ Potential issue | 🟠 Major

Version mismatch with build configuration.

The registration config specifies 1.3.3-komipo, but build-script/doc-parser-build.config builds version 1.3.7-komipo. This inconsistency will cause registration to reference a non-existent or outdated image version. Align these versions to prevent deployment failures.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@genon/preprocessor/scripts/register.config` at line 4, The IMAGE_TAG value in
register.config ("IMAGE_TAG=\"1.3.3-komipo\"") is out of sync with the
build-script/doc-parser-build.config which builds "1.3.7-komipo"; update the
IMAGE_TAG constant in genon/preprocessor/scripts/register.config to
"1.3.7-komipo" (or alternatively update the build config so both use the same
canonical version) so registration references the exact image version produced
by doc-parser-build.config.
genon/preprocessor/module/utils/util.py-143-146 (1)

143-146: ⚠️ Potential issue | 🟠 Major

Don't return a PDF path when no PDF was created.

If WeasyPrint is unavailable, this function still returns pdf_path even though nothing was written. Callers will treat that as a successful conversion and fail later on missing-file access.

Proposed fix
     html_content = markdown(md_content)
-    if HTML:
-        HTML(string=html_content).write_pdf(pdf_path)
-    return pdf_path
+    if HTML is None:
+        return None
+    HTML(string=html_content).write_pdf(pdf_path)
+    return pdf_path
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@genon/preprocessor/module/utils/util.py` around lines 143 - 146, The function
currently returns pdf_path even when WeasyPrint/HTML isn't available; update the
logic around HTML(string=html_content).write_pdf(pdf_path) so you only return
pdf_path after confirming a PDF was actually created: if HTML is falsy or
write_pdf wasn't called/failed, raise a clear exception (e.g., RuntimeError) or
return None, and after calling HTML(...).write_pdf(pdf_path) verify the file
exists (os.path.exists(pdf_path)) before returning. Refer to the HTML symbol and
the pdf_path variable in util.py to locate and change the post-conversion return
behavior.
genon/preprocessor/module/base_processor.py-125-126 (1)

125-126: ⚠️ Potential issue | 🟠 Major

Honor the configured chunker instead of hardcoding "simple".

The constructor ignores config["chunker"], so this new config surface is already ineffective and any future non-simple processor will silently use the wrong chunker.

Proposed fix
-        self.chunker = CHUNKERS["simple"]()
+        chunker_name = self.config.get("chunker", "simple")
+        self.chunker = CHUNKERS[chunker_name]()
         self.genos_meta_builder = GenOSVectorMetaBuilder()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@genon/preprocessor/module/base_processor.py` around lines 125 - 126, The
constructor currently hardcodes self.chunker = CHUNKERS["simple"](), ignoring
the provided config["chunker"]; change it to read the chunker name from the
config (e.g., chunker_name = config.get("chunker", "simple")), validate that
chunker_name exists in CHUNKERS, and then instantiate with self.chunker =
CHUNKERS[chunker_name](); if the name is invalid, either fall back to "simple"
and log a warning or raise a clear error so the configured chunker option in the
BaseProcessor constructor is honored.
genon/preprocessor/module/utils/util.py-56-73 (1)

56-73: ⚠️ Potential issue | 🟠 Major

Handle the PDF generated from the ASCII temp name.

When cand is the ASCII copy, LibreOffice writes <ascii_name>.pdf into out_dir, but this code only checks for pdf_path based on the original filename. Non-ASCII inputs can therefore convert successfully and still return None.

Proposed fix
         for cand in candidates:
             cmd = [
                 "soffice",
@@
             ]
             proc = subprocess.run(cmd, env=env, capture_output=True, text=True)
-            if proc.returncode == 0 and pdf_path.exists():
+            generated_pdf = out_dir / f"{cand.stem}.pdf"
+            if proc.returncode == 0 and generated_pdf.exists():
+                if generated_pdf != pdf_path:
+                    generated_pdf.replace(pdf_path)
                 # 성공
                 if tmp_dir:
                     shutil.rmtree(tmp_dir, ignore_errors=True)
                 return str(pdf_path)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@genon/preprocessor/module/utils/util.py` around lines 56 - 73, The conversion
loop currently only checks pdf_path (based on the original filename) so when
LibreOffice writes an ASCII-named PDF for the candidate `cand` the code misses
it; after running subprocess.run(cmd, ...) inside the for cand in candidates
loop, check for the actual generated PDF in out_dir using the candidate's output
name (e.g. generated = out_dir / f"{cand.stem}.pdf") and/or glob for any PDF
that matches the candidate stem, and if that file exists and proc.returncode ==
0 then remove tmp_dir (if present) and return its string path; keep the existing
stderr logging on failure and still cleanup tmp_dir on success.
genon/preprocessor/module/utils/util.py-94-97 (1)

94-97: ⚠️ Potential issue | 🟠 Major

Stop rewriting the entire path string here.

str.replace() will also rewrite matching text in parent directories and basename substrings, not just the suffix. That can produce invalid output paths for files under directories like .../md.archive/....

Proposed fix
 def _get_pdf_path(file_path: str, CONVERTIBLE_EXTENSIONS: list) -> str:
@@
-    pdf_path = file_path
-    for ext in CONVERTIBLE_EXTENSIONS:
-        pdf_path = pdf_path.replace(ext, ".pdf")
-    return pdf_path
+    path = Path(file_path)
+    if path.suffix.lower() in CONVERTIBLE_EXTENSIONS:
+        return str(path.with_suffix(".pdf"))
+    return str(path)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@genon/preprocessor/module/utils/util.py` around lines 94 - 97, The current
loop in util.py uses pdf_path = pdf_path.replace(ext, ".pdf") which replaces
every occurrence of ext in the whole path; instead detect and replace only the
file extension/suffix. Update the logic that computes pdf_path (using file_path,
CONVERTIBLE_EXTENSIONS and pdf_path) to check the path's suffix via
os.path.splitext or pathlib.Path.suffix and only swap the final extension to
".pdf" (or use Path.with_suffix(".pdf")) when the file's extension matches a
member of CONVERTIBLE_EXTENSIONS, leaving directory names untouched.
genon/preprocessor/module/utils/metadata.py-295-298 (1)

295-298: ⚠️ Potential issue | 🟠 Major

Guard appendix string parsing against non-JSON input.

This branch assumes every non-empty string is a JSON array. A plain value like "appendix-1.pdf" will raise JSONDecodeError and abort vector composition.

Proposed fix
         appendix_list = []
         if isinstance(appendix_info, str):
-            appendix_list = (
-                [item.strip() for item in json.loads(appendix_info) if item.strip()] if appendix_info else []
-            )
+            if appendix_info:
+                try:
+                    parsed = json.loads(appendix_info)
+                except json.JSONDecodeError:
+                    parsed = [appendix_info]
+                appendix_list = [item.strip() for item in parsed if isinstance(item, str) and item.strip()]
         elif isinstance(appendix_info, list):
             appendix_list = appendix_info
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@genon/preprocessor/module/utils/metadata.py` around lines 295 - 298, The code
assumes appendix_info (string) is JSON and calls json.loads directly; wrap the
parsing for appendix_info in a try/except to guard against non-JSON strings:
attempt json.loads(appendix_info) and if it raises JSONDecodeError (or
ValueError), fall back to treating the trimmed appendix_info as a single entry
(appendix_list = [appendix_info.strip()] if appendix_info.strip() else []);
preserve the existing empty-string behavior and keep the variable name
appendix_list and the branch for isinstance(appendix_info, str).
genon/preprocessor/module/test_processor.py-6-6 (1)

6-6: ⚠️ Potential issue | 🟠 Major

Use package-relative imports and add missing __init__.py files.

The absolute import from base_processor import BaseProcessor works only when this directory is manually added to sys.path (as shown in test.py line 10). Importing this module through the package path will raise ModuleNotFoundError. Convert to relative imports and ensure the package is properly structured with __init__.py files in genon/preprocessor/ and genon/preprocessor/module/.

Proposed fix
-from base_processor import BaseProcessor
+from .base_processor import BaseProcessor

Also add __init__.py (can be empty) to:

  • genon/preprocessor/__init__.py
  • genon/preprocessor/module/__init__.py
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@genon/preprocessor/module/test_processor.py` at line 6, Change the absolute
import in test_processor.py to a package-relative import (e.g. in
test_processor.py use "from ..base_processor import BaseProcessor") so the
module can be imported via the package; also add empty __init__.py files to the
genon.preprocessor package and the genon.preprocessor.module subpackage (create
__init__.py in genon/preprocessor/ and genon/preprocessor/module/) to make the
package importable.
genon/preprocessor/facade/json_processor.py-89-105 (1)

89-105: ⚠️ Potential issue | 🟠 Major

Keep the JSON vector schema aligned with the shared metadata contract.

This builder emits bboxes/url, while the shared implementation in genon/preprocessor/module/utils/metadata.py and the other preprocessors emit chunk_bboxes, media_files, title, created_date, and appendix. Returning a different shape from one preprocessor is a downstream compatibility break waiting to happen.

Also applies to: 172-175

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@genon/preprocessor/facade/json_processor.py` around lines 89 - 105, The JSON
processor currently defines attributes reg_date, bboxes, url and populates
self.data with keys reg_date, bboxes, url which diverges from the shared
metadata contract; update the attribute names and the self.data keys to match
the shared schema used in genon/preprocessor/module/utils/metadata.py and other
preprocessors: replace reg_date with created_date, bboxes with chunk_bboxes, url
with media_files, and add missing keys title and appendix in self.data (and
initialize corresponding attributes if needed, e.g., self.created_date,
self.chunk_bboxes, self.media_files, self.title, self.appendix) so the emitted
JSON shape matches the shared contract (also apply the same renames/keys for the
other occurrence around the later block referenced).
genon/preprocessor/module/utils/chunkers.py-130-143 (1)

130-143: ⚠️ Potential issue | 🟠 Major

Preserve document order for tables missing from iterate_items().

Prepending every missing table at index 0 reorders the document whenever an omitted table belongs to a later page. That corrupts the chunk text and the header/page associations carried downstream.

genon/preprocessor/module/intelligent_processor.py-1188-1193 (1)

1188-1193: ⚠️ Potential issue | 🟠 Major

enrichment() exits before enrich_document() ever runs.

The first return document makes the actual enrichment call unreachable, so the TOC/metadata enrichment configured in self.enrichment_options is currently disabled.

🐛 Proposed fix
     def enrichment(self, document: DoclingDocument, **kwargs: dict) -> DoclingDocument:
-        return document
-
-        # 새로운 enriched result 받기
         document = enrich_document(document, self.enrichment_options, **kwargs)
         return document
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@genon/preprocessor/module/intelligent_processor.py` around lines 1188 - 1193,
The enrichment method currently returns immediately, making the enrich_document
call unreachable; remove the premature "return document" so that
enrichment(self, document: DoclingDocument, **kwargs) calls
enrich_document(document, self.enrichment_options, **kwargs) and returns that
enriched DoclingDocument; ensure the function signature and final return still
return a DoclingDocument and that self.enrichment_options is passed through to
enrich_document.
genon/preprocessor/facade/attachment_processor_guardrail.py-1469-1476 (1)

1469-1476: ⚠️ Potential issue | 🟠 Major

Filter request kwargs before constructing the text splitter.

__call__() forwards all request kwargs, but RecursiveCharacterTextSplitter(**kwargs) only accepts a small subset. Flags like save_images, include_wmf, or future API parameters will raise TypeError here and break attachment preprocessing.

🐛 Proposed fix
-        text_splitter = RecursiveCharacterTextSplitter(**kwargs)
+        splitter_kwargs = {
+            k: v for k, v in kwargs.items()
+            if k in ["chunk_size", "chunk_overlap", "separators", "length_function"]
+        }
+        text_splitter = RecursiveCharacterTextSplitter(**splitter_kwargs)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@genon/preprocessor/facade/attachment_processor_guardrail.py` around lines
1469 - 1476, The split_documents method passes the entire request kwargs into
RecursiveCharacterTextSplitter causing TypeError for unknown params; before
instantiating RecursiveCharacterTextSplitter in split_documents, filter kwargs
down to the supported keys (e.g., allowed =
{"chunk_size","chunk_overlap","separators",...} or derive via
RecursiveCharacterTextSplitter signature) then set defaults like
chunk_size=20000 on that filtered dict and call
RecursiveCharacterTextSplitter(**filtered_kwargs) so only accepted params are
forwarded.
genon/preprocessor/facade/json_processor.py-311-329 (1)

311-329: ⚠️ Potential issue | 🟠 Major

Use real chunk counts and indices when composing vectors.

split_documents() can return multiple chunks, but every vector here is stamped with n_chunk_of_doc=1, i_chunk_on_doc=1, and page info (1, 1, 1). Multi-chunk JSON inputs will therefore carry incorrect ordering/count metadata.

🐛 Proposed fix
         global_metadata = dict(
-            n_chunk_of_doc=int(1),
-            n_page=int(1),
+            n_chunk_of_doc=len(chunks),
+            n_page=1,
             reg_date=datetime.now().isoformat(timespec='seconds') + 'Z',
             **first_chunk['metadata'],
         )
@@
-        for chunk in chunks:
+        for chunk_idx, chunk in enumerate(chunks):
             vector = (GenOSVectorMetaBuilder()
                         .set_text(chunk["text"])
-                        .set_page_info(1, 1, 1)
-                        .set_chunk_index(1)
+                        .set_page_info(1, chunk_idx, len(chunks))
+                        .set_chunk_index(chunk_idx)
                         .set_global_metadata(**global_metadata)
                         .set_bboxes(None)
                         ).build()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@genon/preprocessor/facade/json_processor.py` around lines 311 - 329, The code
currently hardcodes chunk/page metadata (n_chunk_of_doc=1, page info (1,1,1),
chunk index 1) when building vectors; update json_processor.py so
global_metadata uses the real chunk count (e.g., n_chunk_of_doc = len(chunks) or
the actual total for that document), increment chunk_index_on_page for each
chunk and pass the real chunk index into GenOSVectorMetaBuilder.set_chunk_index,
and compute and pass correct page info into set_page_info (use current_page,
total_pages, and per-page index variables rather than fixed 1s) before calling
set_global_metadata and build so each vector carries correct ordering/count
metadata.
genon/preprocessor/module/utils/chunkers.py-614-658 (1)

614-658: ⚠️ Potential issue | 🟠 Major

List mutation here can skip later oversized sections.

range(len(sections_with_text)) is computed before you pop() and insert() into sections_with_text. If an early section splits into multiple pieces, some original tail sections are never revisited, so they can remain above max_tokens.

🐛 Proposed fix
-        if self.max_tokens > 0:
-            for i in range(len(sections_with_text)):
+        if self.max_tokens > 0:
+            i = 0
+            while i < len(sections_with_text):
                 text, items, h_infos, h_short = sections_with_text[i]
                 token_count = self._count_tokens(text)
                 if token_count < self.max_tokens:
-                    continue
+                    i += 1
+                    continue
@@
-                sections_with_text.pop(i)
-                for new_section in reversed(new_sections):
-                    sections_with_text.insert(i, new_section)
+                sections_with_text[i:i + 1] = new_sections
+                i += len(new_sections)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@genon/preprocessor/module/utils/chunkers.py` around lines 614 - 658, The
for-loop over range(len(sections_with_text)) mutates sections_with_text
(pop/insert) and skips later entries; change it to a while-loop that
re-evaluates length at each iteration: initialize i = 0 and loop while i <
len(sections_with_text), load (text, items, h_infos, h_short) =
sections_with_text[i], compute token_count via self._count_tokens(text); if
token_count < self.max_tokens then i += 1; otherwise perform the existing
grouping (adjust_captions, adjust_pictures_in_tables), compute
item_token_counts, call split_items_evenly_by_tokens, build new_sections with
self._generate_section_text_with_heading, replace the original entry by popping
and inserting new_sections at index i, and then advance i by len(new_sections)
to continue processing the remaining sections without skipping any.
genon/preprocessor/facade/intelligent_processor_ocr.py-1341-1341 (1)

1341-1341: ⚠️ Potential issue | 🟠 Major

Cancellation guard exists but is not executed in the hot path.

assert_cancelled is defined, but the call in Line 1341 is commented out. Long OCR/chunking work will continue even after client disconnect.

Also applies to: 1382-1384

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@genon/preprocessor/facade/intelligent_processor_ocr.py` at line 1341, The
cancellation guard assert_cancelled is defined but its awaited call was
commented out; re-enable and await assert_cancelled(request) at the hot-path
spots where long OCR/chunking work runs (the commented call near the current
assert_cancelled(request) and the nearby region around the other commented
calls) so work stops promptly on client disconnect; ensure these awaits occur
inside the async processing/loop functions handling OCR/chunking (where request
is in scope) and that the functions remain async so the await is effective.
genon/preprocessor/facade/intelligent_processor_ocr.py-848-849 (1)

848-849: ⚠️ Potential issue | 🟠 Major

page_chunk_counts should be request-scoped, not processor-scoped.

self.page_chunk_counts is mutable instance state and is incremented during split_documents but never reset per request. This can leak counts across requests and produce wrong n_chunk_of_page metadata (and race under concurrent use).

🧭 Suggested fix
     async def __call__(self, request: Request, file_path: str, **kwargs: dict):
+        self.page_chunk_counts.clear()
         document: DoclingDocument = self.load_documents(file_path, **kwargs)

Also applies to: 987-997, 1103-1106

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@genon/preprocessor/facade/intelligent_processor_ocr.py` around lines 848 -
849, self.page_chunk_counts is currently stored on the processor instance and
accumulates across requests; make it request-scoped by removing
self.page_chunk_counts and creating a local page_chunk_counts (e.g., a
defaultdict(int)) inside split_documents (or the top-level request handler that
calls split_documents), pass that local map into any helper methods that compute
or set n_chunk_of_page metadata, and ensure you use the local variable when
incrementing counts so it is reset on each request (this also avoids races under
concurrent requests).
genon/preprocessor/facade/intelligent_processor_ocr.py-713-733 (1)

713-733: ⚠️ Potential issue | 🟠 Major

Local metadata model duplicates shared code and already drifted (appendix missing).

This file redefines GenOSVectorMeta / builder instead of reusing genon/preprocessor/module/utils/metadata.py (where appendix exists at Lines 22-48). This creates schema inconsistency across preprocessors and risks metadata loss.

Also applies to: 735-833

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@genon/preprocessor/facade/intelligent_processor_ocr.py` around lines 713 -
733, The file redefines GenOSVectorMeta locally causing schema drift (missing
appendix) instead of reusing the shared model; replace the local GenOSVectorMeta
definition with an import of the canonical model from
genon.preprocessor.module.utils.metadata (or subclass it) so the shared fields
including appendix are present, update any builder/consumer code in
intelligent_processor_ocr.py that referenced the local class to use the
imported/shared GenOSVectorMeta (or a thin subclass) to ensure identical schema
and avoid duplicated definitions.
genon/preprocessor/facade/intelligent_processor_ocr.py-1276-1278 (1)

1276-1278: ⚠️ Potential issue | 🟠 Major

Don’t suppress OCR failures with print(...); pass.

Lines 1276-1278 swallow all OCR errors, which makes partial failures invisible and hard to diagnose. At minimum, log structured context and raise a domain exception (or return an explicit failure signal).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@genon/preprocessor/facade/intelligent_processor_ocr.py` around lines 1276 -
1278, Replace the bare "except Exception as e: print(...); pass" in the OCR
handling block in genon/preprocessor/facade/intelligent_processor_ocr.py with
structured logging and an explicit failure signal: use the module logger (e.g.,
logging.getLogger(__name__)) to log an error including contextual fields
(document id, page number, image metadata) and the exception details, then
either raise a domain-specific exception (e.g., OCRProcessingError) with the
original exception chained (raise OCRProcessingError("OCR failed for document
X", ...) from e) or return a clear failure result/object instead of swallowing
the error; update or add the OCRProcessingError class if needed and ensure
callers handle the new exception/return value.
genon/preprocessor/facade/intelligent_processor_ocr.py-1311-1311 (1)

1311-1311: ⚠️ Potential issue | 🟠 Major

has_text_items table condition is inverted.

Line 1311 marks table content as present only when len(item.data.table_cells) == 0. For normal tables (cells > 0), this evaluates false and can incorrectly trigger the dummy "." fallback path.

✅ Suggested fix
-            if (isinstance(item, (TextItem, ListItem, CodeItem, SectionHeaderItem)) and item.text and item.text.strip()) or (isinstance(item, TableItem) and item.data and len(item.data.table_cells) == 0):
+            if (isinstance(item, (TextItem, ListItem, CodeItem, SectionHeaderItem)) and item.text and item.text.strip()) or (isinstance(item, TableItem) and item.data and len(item.data.table_cells) > 0):
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@genon/preprocessor/facade/intelligent_processor_ocr.py` at line 1311, The
condition that detects tables is inverted: in the compound if that checks item
types (TextItem, ListItem, CodeItem, SectionHeaderItem, TableItem) change the
TableItem branch to test for table_cells present (len(item.data.table_cells) >
0) instead of == 0 so real tables count as having text; update the expression
that references item.data.table_cells accordingly to restore correct
has_text_items behavior and avoid the dummy "." fallback.
genon/preprocessor/facade/intelligent_processor_ocr.py-483-488 (1)

483-488: ⚠️ Potential issue | 🟠 Major

Oversized table splitting is effectively disabled.

In Lines 483-488, the code detects token overflow but then forces split_tables = [table_only_text], so no split actually happens. This can still emit chunks above max_tokens and break downstream embedding/model limits.

✂️ Suggested fix
-                    # split_tables = self._split_table_text(table_only_text, 4096)
-                    split_tables = [table_only_text]
+                    split_tables = self._split_table_text(table_only_text, self.max_tokens)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@genon/preprocessor/facade/intelligent_processor_ocr.py` around lines 483 -
488, When table_tokens > self.max_tokens replace the forced no-op split with a
real split: call the existing helper _split_table_text(table_only_text,
chunk_token_limit) instead of assigning split_tables = [table_only_text];
compute chunk_token_limit from self.max_tokens (or self.max_tokens minus any
per-chunk overhead) and assign its result to split_tables, and fall back to
[table_only_text] only if the splitter returns an empty list. Update the block
that currently references table_only_text, _split_table_text and split_tables so
oversized tables are actually chunked to respect self.max_tokens.
genon/preprocessor/facade/intelligent_processor_ocr.py-1174-1178 (1)

1174-1178: ⚠️ Potential issue | 🟠 Major

Avoid blocking HTTP calls (requests.post) in the async request path.

__call__ is async, but OCR uses synchronous requests.post per cell. This can block the event loop and degrade latency/throughput under concurrent traffic.

#!/bin/bash
# Verify blocking network calls in async flow.
set -euo pipefail

rg -n -C3 'async def __call__|def ocr_all_table_cells|requests\.post\(' genon/preprocessor/facade/intelligent_processor_ocr.py

Also applies to: 1282-1296

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@genon/preprocessor/facade/intelligent_processor_ocr.py` around lines 1174 -
1178, The code uses synchronous requests.post inside post_ocr_bytes (used from
async __call__ / ocr_all_table_cells), which blocks the event loop; replace the
sync call with an async HTTP client (e.g., aiohttp.ClientSession or
httpx.AsyncClient): create an async function post_ocr_bytes_async(img_bytes,
timeout=60) that performs the POST with the same HEADERS/payload and timeout,
returns the parsed dict, and update callers (ocr_all_table_cells and __call__)
to await post_ocr_bytes_async; ensure proper session lifecycle (reuse a session
if many calls) and propagate/handle HTTP errors the same way as the current r.ok
logic.
🟡 Minor comments (2)
genon/preprocessor/module/test.py-55-59 (1)

55-59: ⚠️ Potential issue | 🟡 Minor

Remove the committed debugger stop.

breakpoint() makes this runner hang in non-interactive environments right after writing result.json.

Proposed fix
 end = time.time()
 print(f"Processing time: {end - begin:.2f} seconds")
-
-
-breakpoint()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@genon/preprocessor/module/test.py` around lines 55 - 59, The committed
debugger call breakpoint() at the end of genon/preprocessor/module/test.py
causes the runner to hang; remove the breakpoint() call (or replace it with a
non-blocking log statement) after the timing print (the end/ begin timing block)
so the test runner exits normally; ensure no other interactive debugger calls
remain in the module.
genon/preprocessor/facade/attachment_processor.py-130-131 (1)

130-131: ⚠️ Potential issue | 🟡 Minor

Keep the metadata in sync when you rewrite r.text.

After this replacement, n_char, n_word, and n_line still describe the original chunk. That leaves the vector payload internally inconsistent.

Proposed fix
                 if answer.startswith("[UNSAFE]"):
-                    r.text = "부적절한 텍스트가 포함되어 있으므로 해당 청크를 제거합니다."
+                    replacement = "부적절한 텍스트가 포함되어 있으므로 해당 청크를 제거합니다."
+                    r.text = replacement
+                    if hasattr(r, "n_char"):
+                        r.n_char = len(replacement)
+                    if hasattr(r, "n_word"):
+                        r.n_word = len(replacement.split())
+                    if hasattr(r, "n_line"):
+                        r.n_line = len(replacement.splitlines())
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@genon/preprocessor/facade/attachment_processor.py` around lines 130 - 131,
The replacement of r.text when answer.startswith("[UNSAFE]") leaves the metadata
(n_char, n_word, n_line) inconsistent; update those metadata fields to reflect
the new text after setting r.text (e.g., compute new character count, word
count, and line count from the replacement string) and store them back into
r.metadata (or the same metadata container used for n_char, n_word, n_line) so
the vector payload remains consistent with r.text.
🧹 Nitpick comments (3)
genon/preprocessor/scripts/register_image.sh (2)

163-163: Syntax anomaly: 2>/dev/null after string assignment.

The 2>/dev/null redirect after the closing quote of SQL_INSERT has no effect since this is a variable assignment, not a command execution. This appears to be leftover from previous code.

🧹 Remove the ineffective redirect
       VALUES
-        (LAST_INSERT_ID(), 'DOCKER_IMAGE', 1, 1, NOW(), NOW(), 1, 1);
-    " 2>/dev/null
+        (LAST_INSERT_ID(), 'DOCKER_IMAGE', 1, 1, NOW(), NOW(), 1, 1);
+    "
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@genon/preprocessor/scripts/register_image.sh` at line 163, Remove the stray
redirection following the SQL_INSERT assignment: locate the SQL_INSERT variable
assignment in register_image.sh and delete the trailing `2>/dev/null` that
appears after the closing quote (it's ineffective on assignments); ensure only
the intended string is assigned to SQL_INSERT and that no redirection tokens
remain adjacent to the quoted value.

166-166: Avoid -it flags in non-interactive kubectl exec.

Using -it (interactive TTY) in a script context can cause issues with output parsing and may fail in CI/CD environments. Use -i only or omit both flags when capturing output.

♻️ Proposed fix
   IMAGE_ID=$(
-    kubectl exec -it "${MARIADB_POD}" -n "${K8S_NAMESPACE}" -- \
+    kubectl exec -i "${MARIADB_POD}" -n "${K8S_NAMESPACE}" -- \
       mysql -u "${MYSQL_USER}" -p"${MYSQL_PASS}" llmops -se \
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@genon/preprocessor/scripts/register_image.sh` at line 166, In the kubectl
exec invocation that uses the variables MARIADB_POD and K8S_NAMESPACE, remove
the interactive TTY flag (-t) so the script doesn't use -it in non-interactive
contexts; replace "-it" with either "-i" (if stdin must be kept open) or omit
both flags to capture output reliably in CI/CD. Locate the kubectl exec line in
register_image.sh and update the flags accordingly while keeping the
pod/namespace variables and remaining command arguments unchanged.
genon/README.md (1)

81-84: Add language specifier to fenced code block.

The code block is missing a language identifier, which affects syntax highlighting and linting compliance.

📝 Proposed fix
-```
+```shell
 docker save mncregistry:30500/doc-parser-ocr:1.3.3-komipo | gzip > doc-parser-ocr.tar.gz
 gunzip -c doc-parser-ocr.tar.gz | docker load
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

Verify each finding against the current code and only fix it if needed.

In @genon/README.md around lines 81 - 84, The fenced code block containing the
two commands starting with "docker save
mncregistry:30500/doc-parser-ocr:1.3.3-komipo | gzip > doc-parser-ocr.tar.gz"
and "gunzip -c doc-parser-ocr.tar.gz | docker load" in README.md needs a
language specifier; change the opening triple backticks to include a language
(e.g., use shell) so the block becomes shell ... ``` to enable correct
syntax highlighting and satisfy linting rules.


</details>

</blockquote></details>

</blockquote></details>

---

<details>
<summary>ℹ️ Review info</summary>

<details>
<summary>⚙️ Run configuration</summary>

**Configuration used**: defaults

**Review profile**: CHILL

**Plan**: Pro

**Run ID**: `d14a1a4c-0e58-4938-92e4-4aabcc22ca7a`

</details>

<details>
<summary>📥 Commits</summary>

Reviewing files that changed from the base of the PR and between bdcd8184ca01c43395d775e6d0828702d0e6e7e6 and b5877fe12fa2066a9a7fec8441575ad1e0b4ffbb.

</details>

<details>
<summary>📒 Files selected for processing (19)</summary>

* `build-script/doc-parser-build.config`
* `build-script/paddle-ocr-build.config`
* `genon/README.md`
* `genon/preprocessor/facade/attachment_processor.py`
* `genon/preprocessor/facade/attachment_processor_guardrail.py`
* `genon/preprocessor/facade/intelligent_processor_ocr.py`
* `genon/preprocessor/facade/json_processor.py`
* `genon/preprocessor/facade/legacy/적재용(외부)_ocr.py`
* `genon/preprocessor/facade/oneagent_processor.py`
* `genon/preprocessor/module/base_processor.py`
* `genon/preprocessor/module/intelligent_processor.py`
* `genon/preprocessor/module/test.py`
* `genon/preprocessor/module/test_processor.py`
* `genon/preprocessor/module/utils/chunkers.py`
* `genon/preprocessor/module/utils/genos_util.py`
* `genon/preprocessor/module/utils/metadata.py`
* `genon/preprocessor/module/utils/util.py`
* `genon/preprocessor/scripts/register.config`
* `genon/preprocessor/scripts/register_image.sh`

</details>

</details>

<!-- This is an auto-generated comment by CodeRabbit for review status -->

Comment on lines +100 to +101
GUARDRAIL_WORKFLOW_ID = 694
GUARDRAIL_BEARER_TOKEN = '23c3898fe3264fd597961af23a68fe7c'
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Remove and rotate the committed guardrail token.

This bearer token is now part of the repository history, so anyone with source access can call the workflow service. Load it from environment/secret storage instead and rotate the exposed token before merge.

🧰 Tools
🪛 Betterleaks (1.1.2)

[high] 101-101: Detected a Generic API Key, potentially exposing access to various services and sensitive operations.

(generic-api-key)

🪛 Ruff (0.15.10)

[error] 101-101: Possible hardcoded password assigned to: "GUARDRAIL_BEARER_TOKEN"

(S105)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@genon/preprocessor/facade/attachment_processor_guardrail.py` around lines 100
- 101, The hardcoded secret GUARDRAIL_BEARER_TOKEN (and any other sensitive
value like GUARDRAIL_WORKFLOW_ID if it's sensitive) must be removed from source;
change code to read the bearer token from secure configuration/environment
(e.g., env var name GUARDRAIL_BEARER_TOKEN) and fail fast with a clear error if
it is missing, and ensure callers still use the GUARDRAIL_WORKFLOW_ID constant
only if non-sensitive; rotate the exposed token before merging so the leaked
value is invalidated and update documentation/CI to inject the new secret into
environments instead of committing it.

Comment on lines +109 to +128
def guardrail(func):
@wraps(func)
async def wrapper(*args, **kwargs):
result = await func(*args, **kwargs)
for r in result:
url = f"{GENOS_URL}/workflow/{GUARDRAIL_WORKFLOW_ID}"
headers = dict(Authorization=f"Bearer {GUARDRAIL_BEARER_TOKEN}")

if hasattr(r, "text"):
body = {'question': r.text}

res = requests.post(f'{url}/run/v2', json=body, headers=headers)

answer = res.json()['data']['text']

if answer.startswith("[UNSAFE]"):
r.text = "부적절한 텍스트가 포함되어 있으므로 해당 청크를 제거합니다."


return result
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

fd -type f -name "*attachment_processor_guardrail*" | head -20

Repository: genonai/doc_parser

Length of output: 232


🏁 Script executed:

cat -n genon/preprocessor/facade/attachment_processor_guardrail.py

Repository: genonai/doc_parser

Length of output: 50374


🏁 Script executed:

rg -n "guardrail" genon/preprocessor/facade/attachment_processor_guardrail.py | head -20

Repository: genonai/doc_parser

Length of output: 107


🏁 Script executed:

rg -A 5 "@guardrail" --max-count 10

Repository: genonai/doc_parser

Length of output: 1894


🏁 Script executed:

rg -B 5 -A 15 "requests.post" genon/preprocessor/facade/attachment_processor_guardrail.py | head -50

Repository: genonai/doc_parser

Length of output: 1584


🏁 Script executed:

rg "timeout" genon/preprocessor/facade/attachment_processor_guardrail.py

Repository: genonai/doc_parser

Length of output: 172


Replace synchronous HTTP request with async implementation or use executor to prevent event loop blocking.

The guardrail decorator wraps async functions but makes synchronous requests.post() calls serially inside a for loop without timeout or status validation. This blocks the entire event loop while waiting for the workflow service, stalling all concurrent requests. Use an async HTTP client like aiohttp or httpx, or wrap the blocking call with asyncio.to_thread(). Additionally, add a timeout (e.g., timeout=30) and validate HTTP status before parsing JSON to handle service failures gracefully.

🧰 Tools
🪛 Ruff (0.15.10)

[error] 120-120: Probable use of requests call without timeout

(S113)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@genon/preprocessor/facade/attachment_processor_guardrail.py` around lines 109
- 128, The guardrail decorator's wrapper currently makes blocking synchronous
requests.post calls inside an async loop (using GENOS_URL,
GUARDRAIL_WORKFLOW_ID, GUARDRAIL_BEARER_TOKEN and reading/modifying r.text),
which blocks the event loop; change the implementation to perform the HTTP call
asynchronously (use an async client like aiohttp or httpx async) or offload to a
thread via asyncio.to_thread, include a timeout (e.g., 30s) and check HTTP
status before calling res.json(), and handle network/errors by skipping or
marking the chunk instead of raising so wrapper keeps processing other items.

Comment on lines +1216 to +1235
async def __call__(self, request: Request, file_path: str, **kwargs: dict):
document: DoclingDocument = self.load_documents(file_path, **kwargs)
artifacts_dir, reference_path = self.get_paths(file_path)
document = document._with_pictures_refs(image_dir=artifacts_dir, reference_path=reference_path)

chunks: list[DocChunk] = self.split_documents(document, **kwargs)

vectors = []
if len(chunks) >= 1:
vectors: list[dict] = await self.compose_vectors(document, chunks, file_path, request, **kwargs)
else:
raise GenosServiceException(1, f"chunk length is 0")

text = ""
for vector in vectors:
if len(text) + len(vector.text) > 8192:
break
text += vector.text

return [vectors[0]]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Don't discard all but the first HWPX vector.

After building vectors, this branch always returns [vectors[0]]. Any additional chunks/pages from an HWPX document are silently lost.

🧰 Tools
🪛 Ruff (0.15.10)

[error] 1227-1227: f-string without any placeholders

Remove extraneous f prefix

(F541)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@genon/preprocessor/facade/attachment_processor_guardrail.py` around lines
1216 - 1235, The __call__ method builds a list named vectors via compose_vectors
but then always returns only [vectors[0]], dropping remaining HWPX pages; change
the return logic to preserve all composed vectors (or the subset actually
included under the 8192-char accumulation) instead of returning only the first:
locate __call__, the vectors variable and the for loop that accumulates text and
either return the full vectors list or return the slice of vectors consumed by
the text-length limit so all chunk/page vectors are preserved (use
compose_vectors, vectors, and the accumulation loop to determine the correct
subset).

Comment on lines +1440 to +1446
try:
conv_result: ConversionResult = self.ocr_converter.convert(file_path, raises_on_error=True)
except Exception as e:
print("@@@@", e)
# conv_result: ConversionResult = self.ocr_second_converter.convert(file_path, raises_on_error=True)

return conv_result.document
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Handle OCR conversion failures before returning conv_result.document.

If self.ocr_converter.convert() raises, the except block only logs and execution still falls through to return conv_result.document. That turns the original OCR error into UnboundLocalError.

🧰 Tools
🪛 Ruff (0.15.10)

[warning] 1442-1442: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@genon/preprocessor/facade/attachment_processor_guardrail.py` around lines
1440 - 1446, The try/except around self.ocr_converter.convert() can leave
conv_result undefined on error and cause UnboundLocalError; update the except
block to handle failures before returning conv_result.document by either
re-raising the caught exception or attempting the fallback converter (call
self.ocr_second_converter.convert(file_path, raises_on_error=True) and assign to
conv_result) and only returning conv_result.document after conv_result is
guaranteed set; ensure any logged error includes the exception and that a proper
exception is raised or a sensible fallback return is returned if both converters
fail.

Comment on lines +114 to +128
def guardrail(func):
@wraps(func)
async def wrapper(*args, **kwargs):
result = await func(*args, **kwargs)

for r in result:
url = f"{GENOS_URL}/api/gateway/workflow/{GUARDRAIL_WORKFLOW_ID}"
headers = dict(Authorization=f"Bearer {GUARDRAIL_BEARER_TOKEN}")

if hasattr(r, "text"):
body = {"question": r.text}

res = requests.post(f"{url}/run/v2", json=body, headers=headers)

answer = res.json()["data"]["text"]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

# First, check if the file exists and get its structure
file_path="genon/preprocessor/facade/attachment_processor.py"
if [ -f "$file_path" ]; then
    echo "=== File exists ==="
    wc -l "$file_path"
    echo ""
    echo "=== Lines 100-140 (context around 114-128) ==="
    sed -n '100,140p' "$file_path"
else
    echo "File not found. Searching for similar files..."
    find . -name "*attachment_processor*" -type f 2>/dev/null
fi

Repository: genonai/doc_parser

Length of output: 1215


🏁 Script executed:

# Check imports at the top of the file
head -50 "genon/preprocessor/facade/attachment_processor.py"

Repository: genonai/doc_parser

Length of output: 1444


🏁 Script executed:

# Search for guardrail decorator usage and related async patterns
rg "def guardrail|asyncio\.|to_thread" "genon/preprocessor/facade/attachment_processor.py" -A 3 -B 1

Repository: genonai/doc_parser

Length of output: 963


🏁 Script executed:

# Check if requests is imported and how it's used elsewhere in the file
rg "import requests|requests\." "genon/preprocessor/facade/attachment_processor.py" -B 1 -A 1

Repository: genonai/doc_parser

Length of output: 361


Move the synchronous requests.post() call out of the async event loop.

The decorator makes a blocking network call for every chunk with no timeout or error handling. If the guardrail service is slow or fails, it will stall the entire async preprocessing pipeline.

Use asyncio.to_thread() to run the POST in a thread pool, add a timeout, and safely handle the response:

Suggested implementation
def guardrail(func):
    `@wraps`(func)
    async def wrapper(*args, **kwargs):
        result = await func(*args, **kwargs)
+        if not GENOS_URL or not GUARDRAIL_BEARER_TOKEN:
+            return result
 
         for r in result:
             url = f"{GENOS_URL}/api/gateway/workflow/{GUARDRAIL_WORKFLOW_ID}"
             headers = dict(Authorization=f"Bearer {GUARDRAIL_BEARER_TOKEN}")
 
             if hasattr(r, "text"):
                 body = {"question": r.text}
 
-                res = requests.post(f"{url}/run/v2", json=body, headers=headers)
-
-                answer = res.json()["data"]["text"]
+                res = await asyncio.to_thread(
+                    requests.post,
+                    f"{url}/run/v2",
+                    json=body,
+                    headers=headers,
+                    timeout=5,
+                )
+                res.raise_for_status()
+                payload = res.json()
+                answer = ((payload.get("data") or {}).get("text") or "")
🧰 Tools
🪛 Ruff (0.15.10)

[error] 126-126: Probable use of requests call without timeout

(S113)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@genon/preprocessor/facade/attachment_processor.py` around lines 114 - 128,
The guardrail decorator's wrapper currently performs synchronous requests.post
calls inside the async loop; change it to run the network call in a thread via
asyncio.to_thread (or run_in_executor) and wrap that call with asyncio.wait_for
to enforce a timeout, plus add try/except to catch RequestException, timeouts,
and JSON/key errors; specifically update the loop in wrapper (where it builds
url/headers/body and calls requests.post then res.json()) to call a helper that
performs requests.post in a thread, await it with a timeout, check
res.status_code before accessing res.json(), and handle/log exceptions and
missing "data"/"text" keys so a failed guardrail call does not block or crash
the async preprocessing pipeline.

Comment on lines +1566 to +1579
# @@@@ 성민: OneAgent 연동용
if "uploads" in kwargs.keys():
import base64
uploads = kwargs.get("uploads", None)[0]

# @@@@ 전처리기 파일 저장 경로
folder = "/nfs-root/tmp/uploads"

decoded = base64.b64decode(uploads['data'].split(",", 1)[1])
file_path = os.path.join(folder, uploads['name'])

with open(file_path, 'wb') as f:
f.write(decoded)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Sanitize uploaded filenames before writing them.

uploads["name"] is joined directly into /nfs-root/tmp/uploads. A filename containing ../ can escape that directory and overwrite arbitrary files on the worker.

🐛 Proposed fix
         if "uploads" in kwargs.keys():
             import base64
             uploads = kwargs.get("uploads", None)[0]
@@
             # @@@@ 전처리기 파일 저장 경로
             folder = "/nfs-root/tmp/uploads"
+            os.makedirs(folder, exist_ok=True)
 
             decoded = base64.b64decode(uploads['data'].split(",", 1)[1])
-            file_path = os.path.join(folder, uploads['name'])
+            safe_name = f"{uuid.uuid4()}_{os.path.basename(uploads['name'])}"
+            file_path = os.path.join(folder, safe_name)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@genon/preprocessor/facade/oneagent_processor.py` around lines 1566 - 1579,
The code writes an uploaded file using uploads['name'] directly into folder and
is vulnerable to path traversal; update the block handling uploads (variables:
uploads, folder, file_path) to sanitize and validate the filename before
joining: obtain a safe base name (e.g., use os.path.basename or a
secure_filename routine), reject or normalize names that contain path separators
or resolve outside the target folder, and/or replace the name with a generated
safe token (UUID) plus the original extension; then construct file_path with
os.path.join(folder, safe_name) and ensure folder exists and file_path resides
under folder (e.g., by comparing resolved absolute paths) before writing the
decoded data.

Comment on lines +1031 to +1040
# # Mistral-Small-3.1-24B-Instruct-2503, 운영망
# toc_api_base_url="https://genos.genon.ai:3443/api/gateway/rep/serving/502/v1/chat/completions",
# metadata_api_base_url="https://genos.genon.ai:3443/api/gateway/rep/serving/502/v1/chat/completions",
# toc_api_key="022653a3743849e299f19f19d323490b",
# metadata_api_key="022653a3743849e299f19f19d323490b",
# # Mistral-Small-3.1-24B-Instruct-2503, 한국은행 클러스터
# # toc_api_base_url="http://llmops-gateway-api-service:8080/serving/13/31/v1/chat/completions",
# # metadata_api_base_url="http://llmops-gateway-api-service:8080/serving/13/31/v1/chat/completions",
# # toc_api_key="9e32423947fd4a5da07a28962fe88487",
# # metadata_api_key="9e32423947fd4a5da07a28962fe88487",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Remove and rotate the API keys in comments.

Commented credentials are still committed secrets and remain readable in repository history and logs. These values need to be removed from source and rotated before merge.

🧰 Tools
🪛 Betterleaks (1.1.2)

[high] 1034-1034: Detected a Generic API Key, potentially exposing access to various services and sensitive operations.

(generic-api-key)


[high] 1035-1035: Detected a Generic API Key, potentially exposing access to various services and sensitive operations.

(generic-api-key)


[high] 1039-1039: Detected a Generic API Key, potentially exposing access to various services and sensitive operations.

(generic-api-key)


[high] 1040-1040: Detected a Generic API Key, potentially exposing access to various services and sensitive operations.

(generic-api-key)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@genon/preprocessor/module/intelligent_processor.py` around lines 1031 - 1040,
Remove the hard-coded and commented API credentials and keys (the commented
toc_api_key and metadata_api_key and any commented
toc_api_base_url/metadata_api_base_url entries) from the source in
intelligent_processor.py, replacing them with non-secret placeholders or
references to environment/config variables used by your app (e.g., use
TOC_API_KEY/METADATA_API_KEY env names) and update any documentation comments
accordingly; delete the historical secret values from comments, ensure no other
commented credentials remain in the file (search for toc_api_key,
metadata_api_key, toc_api_base_url, metadata_api_base_url), and confirm that the
real keys are rotated/invalidated before merging.

Comment on lines +1282 to +1288
def get_media_files(self, doc_items: list):
temp_list = []
for item in doc_items:
if isinstance(item, PictureItem):
name = path.rsplit("/", 1)[-1]
temp_list.append({"path": path, "name": name})
return temp_list
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

get_media_files() will crash on the first picture.

path is never defined before path.rsplit(...). When upload_files is enabled and a chunk contains an image, this raises NameError from compose_vectors().

🐛 Proposed fix
     def get_media_files(self, doc_items: list):
         temp_list = []
         for item in doc_items:
             if isinstance(item, PictureItem):
+                path = str(item.image.uri)
                 name = path.rsplit("/", 1)[-1]
                 temp_list.append({"path": path, "name": name})
         return temp_list
🧰 Tools
🪛 Ruff (0.15.10)

[error] 1286-1286: Undefined name path

(F821)


[error] 1287-1287: Undefined name path

(F821)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@genon/preprocessor/module/intelligent_processor.py` around lines 1282 - 1288,
get_media_files currently uses an undefined variable path which causes a
NameError; update the function (get_media_files) to read the path from the
PictureItem instance (e.g., item.path or item.src depending on the PictureItem
API), skip items with no path, and compute the filename with os.path.basename
(or path.rsplit("/",1)[-1]) before appending {"path": path, "name": name};
ensure the code references the correct attribute on PictureItem and adds a
defensive check for missing/empty paths so compose_vectors doesn't crash when
images are present.

Comment on lines +345 to +347
if upload_files:
file_list = self.get_media_files(chunk.meta.doc_items)
upload_tasks.append(asyncio.create_task(upload_files(file_list, request=request)))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

This upload path calls a method that does not exist.

GenOSVectorMetaBuilder defines set_media_files(...), but not get_media_files(...). As soon as upload_files is available, this branch raises AttributeError and breaks the whole request.

Proposed fix
+    def get_media_files(self, doc_items: list) -> list[dict]:
+        files = []
+        for item in doc_items:
+            if isinstance(item, PictureItem) and item.image is not None:
+                path = str(item.image.uri)
+                name = path.rsplit("/", 1)[-1]
+                files.append({"path": path, "name": name})
+        return files
+
@@
             chunk_index_on_page += 1
             if upload_files:
                 file_list = self.get_media_files(chunk.meta.doc_items)
                 upload_tasks.append(asyncio.create_task(upload_files(file_list, request=request)))
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@genon/preprocessor/module/utils/metadata.py` around lines 345 - 347, The code
calls a non-existent method get_media_files which raises AttributeError; either
implement get_media_files(doc_items) on GenOSVectorMetaBuilder to return the
list of media file paths extracted from chunk.meta.doc_items, or change this
branch to use the existing set_media_files API (e.g., call
self.set_media_files(chunk.meta.doc_items) and then read the list from the
attribute that set_media_files populates) so that upload_files(upload_files,
request=request) receives a proper list; update the upload_tasks creation
(asyncio.create_task(upload_files(file_list, request=request))) to use the valid
file_list source and ensure file_list is an actual list of file paths/objects
acceptable to upload_files.

Comment on lines +115 to +118
def convert_md_to_pdf(md_path):
"""Markdown 파일을 PDF로 변환"""
install_packages(["chardet"])
import chardet
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

# Find the util.py file
fd -t f "util.py" | head -20

Repository: genonai/doc_parser

Length of output: 147


🏁 Script executed:

# Read the relevant file and lines
if [ -f "genon/preprocessor/module/utils/util.py" ]; then
  cat -n "genon/preprocessor/module/utils/util.py" | sed -n '110,125p'
fi

Repository: genonai/doc_parser

Length of output: 601


🏁 Script executed:

# Check for install_packages definition or import in the file
rg "install_packages" genon/preprocessor/module/utils/util.py -B 2 -A 2

Repository: genonai/doc_parser

Length of output: 178


🏁 Script executed:

# Also check for any imports at the top of the file
head -50 "genon/preprocessor/module/utils/util.py" | cat -n

Repository: genonai/doc_parser

Length of output: 1782


🏁 Script executed:

# Search entire file for install_packages
rg "install_packages" genon/preprocessor/module/utils/util.py

Repository: genonai/doc_parser

Length of output: 95


🏁 Script executed:

# Check if install_packages might be defined in genos_util.py or imported
rg "install_packages" genon/preprocessor/module/utils/

Repository: genonai/doc_parser

Length of output: 135


🏁 Script executed:

# Check broader search for install_packages in the codebase
rg "def install_packages|import.*install_packages" --type py | head -20

Repository: genonai/doc_parser

Length of output: 393


🏁 Script executed:

# Check which install_packages definitions exist and their context
rg "def install_packages" -A 5 genon/preprocessor/facade/ | head -40

Repository: genonai/doc_parser

Length of output: 2568


🏁 Script executed:

# Check entire convert_md_to_pdf function for context
sed -n '115,160p' genon/preprocessor/module/utils/util.py | cat -n

Repository: genonai/doc_parser

Length of output: 1345


🏁 Script executed:

# Check if util.py imports anything from facade modules
rg "from.*facade|import.*facade" genon/preprocessor/module/utils/util.py

Repository: genonai/doc_parser

Length of output: 44


convert_md_to_pdf() crashes immediately with NameError due to undefined install_packages.

Line 117 calls install_packages(["chardet"]), but this function is neither defined nor imported in this module. The function will fail before any conversion work occurs.

The install_packages() function exists in other facade modules (e.g., genon/preprocessor/facade/attachment_processor.py), but is not available in this utility module. Either import it from the appropriate facade module or remove the call if dynamic package installation is not needed.

🧰 Tools
🪛 Ruff (0.15.10)

[error] 117-117: Undefined name install_packages

(F821)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@genon/preprocessor/module/utils/util.py` around lines 115 - 118, The
convert_md_to_pdf function calls install_packages(["chardet"]) but
install_packages is not defined or imported, causing a NameError; fix by either
importing install_packages from its provider (e.g., the facade module where it's
defined) and using that import in convert_md_to_pdf, or remove the dynamic
install call if not needed and rely on a normal dependency declaration; update
the top of genon/preprocessor/module/utils/util.py to import the
install_packages symbol (or delete the call inside convert_md_to_pdf) and ensure
any side-effects (like error handling) are preserved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants